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Abstract — Block diagonalization (BD) is an attractive tech- 
nique that transforms the multi-user multiple-input multiple- 
output (MU-MIMO) channel into parallel single-user MIMO 
(SU-MIMO) channels with zero inter-user interference (lUI). In 
this paper, we combine the BD technique with two deterministic 
vector perturbation (VP) algorithms that reduce the transmit 
power in MU-MIMO systems with linear preceding. These 
techniques are the flxed-cemplexity sphere encoder (FSE) and 
the QR-decempositien with M-algorithm encoder (QRDM-E). 
In contrast to the conventional BD VP technique, which is 
based on the sphere encoder (SE), the proposed techniques 
have fixed complexity and a tradeoff between performance and 
complexity can be achieved by controlling the size of the set of 
candidates for the perturbation vector. Simulation results and 
analysis demonstrate the properness of the proposed techniques 
for the next generation mobile communications systems which 
are latency and computational complexity limited. In MU-MIMO 
system with 4 users each equipped with 2 receive antennas, 
simulation results show that the proposed BD-FSE and BD- 
QRDM-E outperforms the conventional BD-THP (Tomlinson 
Harashima preceding) by 5.5 and 7.4dB, respectively, at a target 
BER of 10^. 

Index Terms — Multi-user MIMO, Vector Perturbation, Block 
Diagonalization, Sphere Encoder, QRD-M Encoder. 

I. Introduction 

Single-user multiple-input multiple-output (SU-MIMO) 
^ techniques, i.e., point-to-point links, have shown tremendous 

■ capacity gains without requiring additional frequency-time re- 
' sources fl |. In practice, each base station (BS) communicates 

simultaneously with a large number of users using scarce 
I spectral resources. Therefore, multi-user MIMO (MU-MIMO) 

■ techniques are required to enable BSs communicating with 
multiple users on the same frequency band and at the same 

■ time instant 12 • In analogy with the SU-MIMO case, it has 
"been shown that when ut antennas are used to transmit to 

njj users with antennas each, the downlink sum capacity 
grows linearly with min(r^T,?^^7 x n^j) ||3]. 

To achieve the maximum sum capacity at the downlink of 
the MU-MIMO systems, several approaches relying on the 
information-theoretic principle of dirty-paper coding (DPC) 
were proposed. DPC was initially proposed by Costa, where 
he showed that the capacity of an interference channel with 
known interference is exactly the same as the interference- 
free channel jj]. In communication systems, the transmitted 
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signal to one user can be seen as interference for the other 
users. Because this interference is known to the BS and the 
channel can be fed back by the users, inter-user interference 
(lUI) can be canceled, or highly reduced, by means of MU- 
MIMO precoding. 

Channel inversion and regularized channel inversion are 
the simplest precoding techniques Q. These approaches are 
sometimes referred to as zero-forcing (ZF) and minimum- 
mean square error (MMSE), respectively. When the channel 
matrix is ill-conditioned, channel inversion precoding requires 
high transmission power, leading to degradation in the error 
performance. Although regularized inversion precoding im- 
proves the conditionality of the precoding matrix, thus, reduces 
the required transmit power, its error performance is still 
mediocre. 

Tomlinson-Harashima precoding (THP) limits the transmit 
power by introducing the non-linear modulo operation fSl, fTl. 
As a consequence, out of constellation points at the output of 
the precoder are rounded off to a pre-defined range. A lin- 
earized version of the THP that consists of vector perturbation 
stage and lUI cancellation stage was presented in [8|. The 
vector perturbation stage perturb the data vector such that the 
transmit power is reduced. Then, the transmitted signal can 
be recovered at the receivers by the same modulo operation. 
The lUI cancellation stage can be either done successively or 
using any of the aforementioned linear precoders. Notice that 
the aforementioned precoding approaches assume that a single 
stream is transmitted to each user 

Block diagonalization (BD) algorithm, which transforms 
the MU-MIMO link into parallel SU-MIMO links, supports 
multi-stream transmission ||9l. BD uses a precoding matrix that 
ensures zero lUI, where consequently users' data are processed 
in parallel leading to a reduction in the processing time at the 
BS side. 

In fTOl, a combination of MMSE-THP and BD scheme 
was introduced to improve the system performance. Although 
THP outperforms matrix inversion precoding scheme and 
its regularized form, the obtained perturbation vector is not 
optimum. This implies that further reduction in the required 
transmit power can be achieved if the perturbation vector is 
optimized. 

Related works. In [11|, the idea of vector perturbation was 
introduced for single-antenna decentralized users, where the 
perturbation problem is solved using the sphere encoder (SE). 



In lfT2l and lfT3l . the vector perturbation technique is general- 
ized for the multi-receive antenna users. This is accomplished 
by employing the BD algorithm. 

Contributions. Our contributions are summarized as follows: 

• We discuss the computational complexity and latency 
issues of the SE, and its applicability in the downlink 
of the multi-user multi-receive antennas MIMO systems. 

• To overcome the drawbacks of the SE, we propose two 
deterministic BD vector perturbation techniques. These 
techniques are the fixed-complexity sphere encoder (FSE) 
lfT4l and the QR-decomposition with M-algorithm en- 
coder (QRDM-E) |T5l combined with the BD algorithm. 
Furthermore, we optimize the size of the list of candidates 
for the perturbation vector stage such that a tradeoff 
between performance and complexity is achieved. 

The rest of this paper is organized as follows. In Section 
2, we introduce the system model and the BD technique. In 
Section 3, the proposed BD vector perturbation techniques are 
introduces in details. Also, a review of the conventional BD 
vector perturbation with SE is addressed. Simulation results 
and discussions are introduced in Section 4 and conclusions 
are drawn in Section 5. 



II. System Model for MU-MIMO with Block 

DiAGONALIZATION 

We consider the downlink of a MU-MIMO system, i.e., 
transmission from base station to users, with nx transmit 
antennas and receive antennas per user We assume that 
riT = [nn X nu) where njj is the number of users. Under 
the assumption of narrow-band flat-fading channel, the MU- 
MIMO channel matrix H e C^R'^u^nr gjygjj i^y. 



such that 



H = [Hf H 



(1) 



where e C""^"^ is the channel coupling the ut transmit 
antennas to the receive antennas of user i, and (•)^ denotes 
the matrix transpose. 



A. Block Diagonalization 

The inter-user interference (lUI) can be fully canceled 
out using the BD algorithm. Therefore, BD transforms the 
MU-MIMO channel into paraUel single-user MIMO (SU- 
MIMO) channels. The inter-symbol interference (ISI) among 
symbols belonging to certain user can be either removed at 
the transmitter by means of precoding, or at the receiver by 
employing spatial demultiplexing, i.e., detection. To reduce 
the complexity of the users' receivers, we consider that the 
channel effect is equalized for at the transmitter side by means 
of precoding. 

The purpose of the BD algorithm is to find B G C"^^"^ 
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where is the rifj x nu zero matrix and Heff.i — H^Bi is 
the effective channel matrix of user i after the BD. To this 
end, we define the matrix 



Hi 



(3) 



which is obtained by simply removing the channel matrix of 
user i from the system channel matrix H. The singular value 
decomposition (SVD) of H:^ is formed as follows: 

H 



U7& 



7(1) v(0) 



(4) 



where the columns of V^"'' are the right singular vectors 
corresponding to the zero singular values of H^. Since the 
columns of 'W-^'^ lead to zero lUI, they will be potential 
beamformers for user i. Therefore, a linear combination of 
these vectors is found to form the beamforming matrix B,;. To 



accomplish this, the SVD of H^V 



(0) 



is formed as follows: 

(i: 



H 



(5) 



where HiV^°^ is considered to be full-rank. B^ is then equal to 
ylojyU) ^jjjj jjjg transmit beamforming matrix B G cmtxtit 
is given by: 



B = [Bi B2 



Br, 



(6) 



Notice that when the number of users increases for a fixed 
riT, the degrees of freedom at the base station are spent in 
the lUI nulling process, leading to reduction in the transmit 
diversity of the array. 

III. The Proposed Block Diagonalization Vector 
Perturbation 

Applying the BD algorithm to H transforms the MU- 
MIMO channel into njj parallel SU-MIMO channels with 
zero lUI. Therefore, user streams can be processed in parallel, 
leading to the reduction in the precoding latency. Fig. [T] shows 
the resulting end-to-end SU-MIMO system for user i. The 
details of Fig. [T] are provided in the following Subsections. In 
this paper, all users' symbols are withdrawn from the same 
constellation set. Hence, we did note employ any kind of 
power loading techniques, and the beamforming matrix is 
equivalent to B^. In the sequel, we consider the data processing 
for user i, which is identical to those of other users. Also, we 
consider that the SU-MIMO system for user i is mapped to 
the A^-dimensional real Euclidean space for N = 2nij, where 
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Fig. 1 . Structure of the broadcast MIMO precoding system using block diagonalization and vector perturbation techniques for user i. 



real vectors and matrices are represented by italic boldface 
symbols. 

A. Vector Perturbation for MU-MIMO Systems and a Review 
of the Sphere Encoder 

The goal of the vector perturbation techniques is to generate 
the vector Si from the data vector Si, such that the norm of 
Ges,iSi becomes smaller than that of Geff.iSi- Here, Geff.i is 
the precoding matrix for user i. The precoding matrix //eff.i 
equals //gf/j and {Hi^a^iH^ff ^ + aI)''^H^ff ^ for the linear ZF 
and MMSE precoders, respectively. The regularization factor 
a equals {Na'^/Pi), where ct^ and Pi are the single-sided 
noise variance and the total transmit power for user i. In other 
words, the vector perturbation technique is employed to reduce 
the required transmission power. The perturbed vector Si is 
then derived from the THP technique as follows: 



rt, 



(7) 



where r is an integer that depends on the used modulation 
scheme, and t is an A^-dimensional integer vector. In IfTTI . r 
is given by: 

r-2(|c™„,|+A/2), (8) 

where |cmaa:| is the absolute value of the symbol with the 
largest magnitude, and A is the spacing between any two 
neighbor symbols. Since the precoding equalizes for the chan- 
nel effect, the received vector at user i is given by: 

J, (9) 

where rii is the additive-white Gaussian noise vector with 
covariance matrix a^I. At the receiver, the original data vector 
Si is recovered, without knowledge of vector t, using the non- 
linear modulo operation as follows: 



Si = modCy;), 



(10) 



where mod(-) is the modulo operation that reduces the range 
of the received signal to the interval [—K, K), where K 
depends on the used modulation scheme fTSl. Specifically, 
K = where \Q\ is the cardinality of the modulation 

set n. For instance, = 2 and 4 for QPSK and 16-QAM 
modulation schemes, respectively. 



The vector t, introduced in Q, is found by solving the 
following A^-dimensional integer lattice problem: 

t = argmin {{s + Tt)"G^^fi,nA^ + ^^)} , 



argmin ||Geff,i(s + rt) 



(11) 



In fT2l and fVS\, authors outlined the benefit of employing 
the BD algorithm in transforming the {N x n[/)-dimensional 
lattice search problem into njj parallel A^-dimensional search 
problems. Clearly, this leads to the reduction in the precoding 
latency. However, we state the following remarks about using 
the SE to solve the lattice problem in (fTTT i: 

1) Sphere encoder: Authors used the sphere encoder (SE) 

which was originally introduced by Hochwald et al. 

in ifm . Although the SE achieves great reduction in 

the required transmission power, it suffers from the 

following drawbacks: 

• Worst-case complexity: Although the average com- 
plexity of the SE is polynomial in the problem size 
|16|, its worst-case complexity is exponential, i.e., 
comparable to that of the brute-force searcl^ There- 
fore, in computational complexity limited commu- 
nication systems, the SE becomes inapplicable. 

• Maximum latency: Because SE is sequential in 
the tree search phase, this limits the possibility for 
efficient hardware implementation by pipelining. 

• User-dependent precoding latency: Because the 
computational complexity of the SE depends, 
among other factors, on the conditioning of //eif.i, 
the precoding latency may differ from a user to 
another Hence, the vector-perturbation stage for a 
user with ill-conditioned effective channel matrix 
is more time consuming than that for a user with 
well-conditioned channel matrix. Therefore, the pro- 
cessing latency at the transmitter for precoding a 
data vector is equivalent to the maximum latency to 
precode s,;, for i = 1, 2 • • • , nu. Thus, additional 
latency overhead is introduced at the transmitter 
side due to the random complexity of the vector 
perturbation stage using the SE. 

'in fact, Jalden and Ottersten have shown in |17l that even the average 
computational complexity of the sphere decoder is exponential in the problem 
size for a fixed signal to noise ratio. 



2) Choice of the set of candidates for t: Because the 
elements of t can have any integer value, the set of 
candidates for t should be a truncated subset of Z to 
restrict both the preceding latency and computational 
complexity. Authors in ifTTIl . lfT2ll . and ifTSll did not im- 
pose any restriction on the size of the set of candidates, 
leading to a huge worst-case complexity. 

B. Problem Formulation 

To solve ( fTTI ) successively in the case of the ZF preceding, 
the LQ factorization of the effective channel matrix //eff,i is 
required. Let the transpose of //eft\i be factorized into the 
product of a unitary matrix Q and an upper triangular matrix 
R, then, the search problem in (fTTI) can be simplified to: 



Root 
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(12) 



where s„ is the rt-th element of Si, and the lower triangular 
matrix L equals (/?^^)^. Also, ik and tk represent the retained 
candidate and a possible candidate for f, respectively, at the 
fc-th preceding level. In the case of the MMSE precoder, the 
extended matrix H^n.i = [H'^ffi is factorized, where L 

again equals (R^^)"^. Due to the QR property 



Qi 
Q2 



R 



Q2R 



(13) 



it holds that/?"^ = Q2I ^jot \ By definition is a strictly 
positive real number, then it does not affect the search result in 
(fT2l l. Therefore, L = also leads to the required perturbation 
without the need for explicitly inverting R. 

The elements of t are drawn from the symmetric integer set 



A 



1. 



1, a], 



(14) 



where a is a positive integer chosen to achieve a tradeoff be- 
tween performance and complexity of the vector-perturbation 
stage. Hereafter, T = (2a + l) denotes the number of elements 
of the set A. 

In the following Subsections, we introduce the proposed BD 
vector perturbation techniques that overcome the aforemen- 
tioned drawbacks of the BD-SE with a tolerable sub-optimality 
in solving the integer lattice problem in (fTTT l. 

C. Fixed-complexity Sphere Encoder (FSE) 

The tree-search phase of the FSE algorithm consists of the 
following two steps: 

> Full expansion: At the first p tree search levels, the 

retained branches are expanded to all possible nodes, and 

all the resulting branches are retained for the next level. 
• Single expansion: All retained branches in the precedent 

level are independently expanded to all possible nodes. 

Then, the accumulative metrics of the resulting branches 



Accumulative 
metric 




Full expansion 



Single expansion 



Fig. 2. Example of the FSE for T = N = 3. 

are calculated using (fT2l l. and only the branch with the 
smallest accumulative metric is retained for the next level. 
At the last search level, the metrics of the obtained perturbed 
vectors Si, S2, - ■ ■ ,St are compared, and the vector that has 
the smallest metric is preceded and transmitted. 

Fig. |2] depicts an example of the FSE for T = N = 3 
and p = 1. At the first level, i.e., i = 1, the root node is 
expanded to all possible combinations (si + rtk) for tk € 
{—1, 0, 1}. The metrics of the resulting branches are calculated 
via (fT2] l. and all the branches are retained for the next search 
level. Each retained branch at level 1 is expanded to the three 
possible combinations (s2 + rtk) for tk E { — 1, 0, 1}, and the 
branch with the smallest accumulative metric is retained. This 
strategy is repeated at level 3, where the leaf, i.e., perturbed 
vector, that has the lowest accumulative metric is preceded 
and transmitted. 

The advantages of using the BD-FSE over the conventional 
BD-SE are summarized as following: 

« BD-FSE algorithm has a fixed complexity that is inde- 
pendent of the channel conditionality. A tradeoff between 
complexity and performance is achieved by selecting an 
appropriate value for a. Therefore, for an equal number 
of receive antennas per user, the latency of the vector 
perturbation stage is the same for all the users. This 
avoids the time wasting problem of the conventional BD- 
SE, resulting when the latency of the vector perturbation 
stage varies from a user to another 
• The vector perturbation stage of the BD-FSE is parallel, 
as shown in Fig. |2l leading to efficient hardware im- 
plementation by pipelining. This reduces the preceding 
latency which is an important issue in the communication 
systems of beyond third generation (3G) |20|. 
For (p = 1), the proposed FSE visits only {N x T) nodes to 
obtain the perturbed vector with a computational complexity 
much lower than that of the QRDM-E as will be shown in the 
next Section. 

D. QR-decomposition with M-algorithm encoder ( QRDM-E) 

In QRDM-E, the best M branches that have the least 
accumulative metrics are retained at each encoding level. To 
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Fig. 3. Example of the QRDM-E for T = = 3. 



accomplish a fair comparison with the FSE, M is set to T. 
Therefore, at the first tree-search stage, the best M branches 
are retained for level 2. At level 2, the retained branches 
are expanded to all possible combinations of (s2 + Ttk). 
The resulting A/^ branches are sorted according to their 
accumulative metrics calculated via (fT2l i. where only the M 
branches with the smallest accumulative metrics are retained 
for level 3. This strategy is repeated up to the last encoding 
level, where the perturbed vector Si that has the smallest 
accumulative metric is precoded and transmitted. 

Fig. [3] depicts an example of the QRDM-E for T = N = 
M = 3. At each encoding level, only the best three branches 
with the least accumulative metrics are retained. In contrast 
to the SE which visits M"^) nodes as the worst-case 

complexity, QRDM-E has a fixed complexity where it only 
visits (M +{N - 1)AP) nodes. 

IV. Simulation Results and Discussions 

In this Section we optimize the size of the set of candidates 
A. Then, we evaluate the bit error rate (BER) performance 
of the proposed BD vector perturbation techniques in an 
{nT,nij,nji) MU-MIMO systems, with ut = nu x nj^. The 
conventional THP approach is considered as the special case 
of the vector perturbation techniques when only the branch 
that has the least accumulative metric is retained at each tree 
search level. Also, due to its superior performance compared 
to the ZF criterion, the MMSE criterion is used to construct 
the precoding matrix Ge//. 

Fig. |4] depicts the BER performance of the proposed 
BD vector perturbation techniques in (8,2,4) and (8,4,2) 
MU-MIMO systems at SNR of 20 and 25dB, respectively, 
for several values of T and using 4-QAM. We remark that 
for both BD-FSE and BD-QRDM-E techniques, a maximum 
improvement is achieved when moving from (T = 3) to 
(T = 5). In the case of the BD-FSE, additional improvement 
in the BER performance can't be achieved for T > 7. On the 
other hand, small additional improvement is achieved in the 
case of the BD-QRDM-E algorithm for T > 7. As a tradeoff 
between performance and complexity, we use T = 7 in the 
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Fig. 4. Choice of the number of elements in the set of candidates A using 
4-QAM. 
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Fig. 5. BER performance of the proposed BD vector perturbation techniques 
for T = 1, and using 4-QAM modulation. 



sequel; that is, 

^= {-3,-2,- •• ,2,3}. (15) 

Fig. |5] depicts the BER performance of the proposed BD 
vector perturbations techniques for several system configura- 
tions. DB-QRDM-E outperforms BD-FSE for all system con- 
figurations. For instance, at tai-get BER of 10""*, BD-QRDM- 
E outperforms BD-FSE by 1.65, 2.1, and 1.7dB in (8,1,8), 
(8,2,4), and (8,4,2) MU-MIMO systems, respectively. The 
advantage of the BD-FSE compared to the BD-QRDM-E is 
that it has a parallel tree-search, leading to reduction in the 
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Fig. 6. BER performance of the conventional MMSE-THP witli BD for T = 
7, and using 4-QAM modulation. 

latency of the vector perturbation stage. 

Fig. |6] shows the BER performance of the introduced 
BD-THP algorithm. At BER of lO"'', BD-THP lags the 
performance of BD-QRDM-E and BD-FSE algorithms by 
7.4 and 5.2dB, respectively, in (8,2,4) MU-MIMO system. 
This degradation is due to the non-optimality of the obtained 
perturbation vector. Furthermore, we remark that a floor in the 
BER performance of BD-THP scheme appears at high SNR 
values, which is also due to the non-optimality of the vector 
perturbation stage. 

At high SNR values, the slope of the BER curves is directly 
proportional to the achieved diversity order. From Fig.|5]and|6] 
we remark that the achieved diversity orders by the proposed 
techniques are equivalent and linearly proportional to the 
number of receive antennas per user In contrast, the diversity 
order attained by the conventional BD-THP technique tend to 
be unity, despite that an improvement in the BER is achieved 
when the number of receive antennas per user is increased for 
a fixed riT- 

V. Conclusions 

In this paper, we proposed the combination of fixed- 
complexity FSE and QRDM-E multiuser vector perturbation 
techniques with the block diagonalization algorithm. The block 
diagonalization transforms the MU-MIMO channel into paral- 
lel SU-MIMO channels with zero inter-user interference. FSE 
or QRDM-E technique is used in the vector perturbation stage 
that aims to reduce the transmission power. In the proposed 
algorithms, a tradeoff between computational complexity and 
performance is achieved by controlling the size of the set of 
candidates at the vector perturbation stage. Using extensive 
simulations, the optimum size of the set of candidates is 
obtained. Also, due to its parallel tree-search stage, FSE can be 



pipelined, leading to tremendous reduction in the precoding la- 
tency. Therefore, the proposed algorithms are implementation 
efficient as compared with the conventional BD-SE technique, 
which has a random complexity and sequential tree-search 
stage. In terms of BER performance, the proposed techniques 
outperform the conventional BD-THP technique by more than 
5dB in (8,2,4) MU-MIMO system. Therefore, due to their 
low and fixed complexities, the proposed algorithms are strong 
candidates for implementation in the future communication 
systems. 
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